DS-7331 Machine Learning Project 3

Airbnb Clustering Tasks

Allen Miller, Ana Glaser, Jake Harrison, Lola Awodipe

https://nbviewer.jupyter.org/github/allenmiller17/SMU_7331_ML1_Projects/blob/main/PROJECT%203%20ML7331.ipynb

Introduction

Our project utilizes Airbnb data from kaggle.com.

The main objective of this project is to perform cluster analyses to simply and potentially create more efficient classification models than the previous project.

Table of Contents

  1. Libraries and Loading Data
  2. Business Understanding
  3. Data Understanding
    2.1 Data Meaning
    2.2 Data Quality
           2.2.1 Missing Values
     <br>
    
           2.2.2 Outliers
    2.3 Visualizing Important Features
  4. Modeling and Evaluation
    3.1 KMeans Clustering Analysis
           3.1.1 Selecting Optimal K-Clusters
     <br>
    
           3.1.2 Internal And External Validation Measures
     <br>
    
           3.1.3 Results
     <br>
    
    3.2 Hierichial Clustering Analysis
     <br>
    
           3.2.1 Selecting Optimal Clusters
     <br>
    
           3.2.2 Validation Measures
     <br>
    
           3.2.3 Models
     <br>
    
           3.2.4 Observations
     <br>
    
    3.3 DBScan Clustering Analysis
     <br>
    
           3.3.1 Selecting Optimal Clusters
     <br>
    
           3.3.2 Validation Measures and Results
     <br>
    
    3.4 Comparison of all Clustering Techniques
    3.5 Visualize the results
    3.6 Ramifications
  5. Deployment
  6. Exceptional Work - Applying cluster analyses to Classification Models

0. Libraries and Loading Data

1. Business Understanding

For our project, we decided to use AirBnb data from six major cities in the United States from kaggle.com. Our objective is to classify the type of property based on the data attributes like city, number of reviews, bathrooms, bedrooms and number of people it accommodates.

First we will evaluate the various clustering techniques by visually inspecting the separation and interpretability of the clustering models. We will also compare the effectiveness of our clusters by evaluating the sillhouette score for each technique.

Then we will append the clusters to the dataset as a feature to determine if the classification task is improved by these regressors.

To assess the effectiveness of our classification, we will look at the accuracy, precision, recall and evaluate the confusion matrix results.

2. Data Understanding

The number of reviews feature showed data that was heavily skewed to the left. Because normality is not an assumption for clustering and classification tasks, we weren't worried about having a normal distribution for our feature observations, so we chose not to transform any of these variables.

The attribute of neighborhood made our data very sparse and it increased run time dramatically, when we tested the models with and without it, the difference was negligible, so the cost benefit of run time vs model performance seemed like a fair trade-off, so we eliminated that variable.

2.1 Data Meaning

Adding the unlogged version of the price will help our team interpret the data and give us an idea on how dispersed our data really is.

2.2 Data Quality

2.2.1 Missing Values

Given the volume of our data, we were able to reduce incomplete records with missing values and still retain a significant amount of records to evaluate.

We also evaluated the number of unique values found in the categorical variables, we decided to eliminate the neighborhood attribute since it had 590 distinct values. This made the model run time very slow and only gained a modest amount of accuracy.

2.2.2 Outliers

As displayed in the graphic below, the property type was a very skewed attribute, so we reduced it to the top 5 property types and the rest were labeled as other.

We also encoded the categorical variables that we were going to reserve for our classification models, and transformed the longitude variable to a region called east and west.

We then dropped all of the other columns that would not be used in the proceeding models, like property descriptions, and those that resulted lacking in predictive power, as demonstrated in our previous experiments.

2.3 Visualizing Important Features - Data Understanding 2

Observing price across cities, one of our key regressors. There does not appear to be signficant separation in property price between the various cities.

Observing the data across properties, it is interesting to note that they all share the same distribution across key elements, the clusters in KMeans highlighted the commonalities in the property attributes.

When plotting the data along number of bathrooms, beds and price, it formed a more circular shape, this was a more compact representation of the property types. In the DBScan clustering, we were able to observe a similar delineation of the data according to its clusters.

3. Model and Evaluation 1 - Train and Adjust parameters

3.1 K-Means Clustering Analysis

3.1.1 Selecting Optimal K-Clusters
3.1.1.1 Internal validation measures

There is room for intrepetation on which k is the best option. The elbow method is difficult to apply in this instance because there is not a specific point that the elbow starts. One could argue the elbow begins at 3, 6, 9, or 11. Due to that, our team decided to refer to the yellowbrick package which explicitly stated the optimal k was 6. This package, when used properly, will select an optimal k value based on the distortion score.

In order to determine the optimal k clusters to use we compared the elbow graphic, the yellowbrick function vs the silhouette score per cluster chart, above. We decided to move forward with 6 clusters although the silhouette score is greatest at 2, we have disregard this value because it is common knowledge that two clusters often result in a false positive recommendation. The next range of clusters is 3-5, but we didn't move forward with any of these clusters because the 6 clusters chosen by the yellowbrick package is optimized within an embedded logic (Reference 1).

Reference 1: https://medium.com/data-science-community-srm/machine-learning-visualizations-with-yellowbrick-3c533955b1b3

Evaluating the number of records per clusters, we feel the distribution is adequate among the various clusters. We have appended the cluster value to the data for future evaluation and classification functions.

3.1.1.2 External validation measures

Intercluster distance displays an embedded cluster centers in a two-dimensions with the distance to the centers preserved. The cluster bubbles are sized by the number of instances within those groups.

Just because there is overlap in the two-dimensional space, does not imply they overlap in the original feature space. However, looking at the three-dimensional graph of the selected features using plotly, one could argue against this notion.

3.1.3 Results

The clusters 3,5,0 and 1 show a clear separation and a relationship between price and review score rating. For example, if we compare cluster 0 in purple to cluster 5 in green, they are both around the same price range but are separated by their rating.

If you compare clusters 2 and 4, the main distinction would be the number of reviews, those with more reviews are separated from the rest, despite having a similar price and rating.

Cluster 0 (yellow/orange) - medium price, high rating, high number of reviews
Cluster 1 (purple) - low price, high rating, low number of reviews
Cluster 2 (red) - medium-low price, high rating, low number of reviews
Cluster 3 (green) - low to medium-low price, lower rating, low number of reviews
Cluster 4 (dark blue) - medium-high price, high rating, low number of reviews
Cluster 5 (sky blue) - high price, high rating, low number of reviews

When observing the data across clusters, it was evident that city was not a distinguishing factor, as demonstrated in the boxplots below, the separation among clusters appears to be more pronounced along the price, rating and number of reviews.

3.2 Hierarchical Clustering Analysis

3.2.1 Selecting Optimal Clusters

Selecting the optimal number of clusters in the hierarchical clusters is based on the height of the distance where one determines the best separation is represented.

3.2.2 Validation Measures

We iterated through the various linkage techniques for the hierarchical clusters and compared the cophenetic scores.

We evaluated the following linkage methods: single, which is the minimum distance between clusters, complete - which is the maximum distance, average - the average of the distances between all pairs, ward - minimum within cluster variance , median - which is the median distance between clusters.

3.2.3 Models

We compared the cophenetic coefficient per linkage method above, we will now evaluate the clustering techniques visually in the dendograms below.

We will now truncate the various hierarchical clusters to assign a cluster for visual representation

The distance range for the ward method is quite high compared to the average distance, we will truncate our clusters at distance = 100. This will result in 5 clusters.

The ward hierarchical linkage approach, which optimizes within cluster distance is most closely aligned with the separation we expect and very similar to the results we saw with k-means clustering.

We will now evaluate the clusters when we apply the average linkage technique

3.2.4 Observations - Hierarchical Clustering

After visually comparing the various hierarchical clustering results and evaluating the cophanetic index, we deteremined which clustering linkage method was more appropriate for our data.

The ward linkage method clusters the data among common properties, which makes sense because it is optimizing for the within cluster distance. The average linkage method clustered a majority of the datapoints together, and highlighted the anomaly.

For example, the clusters that were outside of the main group, exhibited a rare combination of traits, like unusually small number of reviews or an extremely high price point. We think average linkage is better for anomaly detection and ward linkage is more appropriate to find commonalities in the data.

3.3 DBScan Clustering Analysis

We will now evaluate the DBScan clustering technique, we illustrated the eps points based on the min point value.

Based on the graph above, we have optimized the eps value and determined the best min points is 2 and the best eps is closest to 1

When we look at the clusters, we can see a majority of the representation in the graphic is cluster '-1'. Once that is removed, then you can see a distinct separation of the clusters. This could be considered noise. Further clustering should be conducted within this space to determine what drives its commonality. It is interesting to note that the linkage method 'average' in the hierarchical clustering exhibited similar behavior.

3.3.1 Selecting Optimal Clusters

Evaluating the DBScan, we used a kneighbors graph to determine the optimal eps and min_points required. Once we produced the ideal number of points, we visually evaluated the separation as identified by the DBSCAN clusters. There appeared to be a distinct separation based on the number of room and how many people it accommodated. This technique used a different set of properties to differentiate the data.

3.3.2 Results and Validation Measures

Observing the silhouette score, we can see that it is very close to 0, but slightly on the negative side, this tell us that our clusters are overlapping. We can observer this because upon visual inspection, we need to omit cluster -1 to show clear separation between the remaining clusters. This indicates there is still room for improvement when applying the DBSCAN technique on our data, given the shape of the data DBSCAN favors, we would be more effective using a different clustering technique more geared towards our data.

3.4 Comparison of all Clustering Techniques - Model and Evaluation 2

When we visually compared the clusters using the 3D plots, we felt that the KMEANS clustering was the most effective technique to seperate our data. However, when we evaluated the silhouette scores side-by-side, we noticed that the Hierarchical clustering had a better silhouette score, with DBSCAN's score representing most points it classified as "noise."

We then had an opportunity to evaluate the data clusters side-by-side and the visual illustration, along with the silhouette score, helped us conclude that it is the hierarchical clustering technique, using the ward linkage method, that gives us the clearest separation of our data.

3.5 Visualize the Results - Model and Evaluation 3

3.6 Ramifications - Model and Evaluation 4

Cluster analyses can be performed in many ways, and it mainly depends on the data scientist's preferences and how they want to approach the problem. There is no right or wrong way to perform a cluster analysis, but there are guidelines to follow depending on the type of cluster analysis. For instance, with K-Means Clustering, one could use the elbow method on the distortion score to determine the 'optimal' k clusters or use the silhouette score closest to 1. Due to these methods giving different 'optimal' k-clusters, the data scientist(s) have to make the decision on which to move forward with. Although having less clusters could be easier for interpretability, we felt as though having 6 clusters in the k-means clustering displayed stronger relationships in separability.

Hierarchical clustering analysis requires the scientist(s) to make arbitrary decisions such as where to draw the line on how many clusters we use based on multiple measures. When deciding which linkage method is best to use, we want to assess the cophenetic coefficient and use the value that is highest. When comparing the largest cophenetic coefficient linkage method to the lowest, we saw that the ward method gave us better separability within the cluster groups and should proceed with it even though the average linkage method had the highest cophenetic coefficient. We believe this is case-by-case and depends on how the instances mend in the analysis. Once we have our method chosen, we have to decide a cut-off for the number of clusters used by truncating at a distance which is displayed in the dendograms. For the ward linkage method, when we selected to cut off at 150 based on our team's perception of the best clustering option. Alongside these two clustering methods, we also use DBScan Clustering. This clustering method utilizes different metrics to create the clusters. For example, we use the 'eps' value and the minimum number of samples per cluster to find the best fit across the different clusters.

Using the DBScan clustering analysis is the most data-driven decision-based method of the three. With these clustering methods having room for human error, one mistake could completely derail the analysis and give less efficient clustered groups for the tasks at hand. The best way to avoid these concerns is to dive deep in the data understanding. Our team is confident we have made the best decisions for our clustering analyses based on our experience with the data, as well as interpreting the clusters within each of the methods

4. Deployment

How will our chosen model be usable by other parties?

5 Exceptional Work: K-Nearest Neighbors Classification

We incorporated the KNN Cluster attribute to our data and re-ran the KMEANS clustering task we conducted on the last project.

After comparing the accuracy and performance metrics of the model augmented with the cluster value, we determined that the accuracy of the classification task was worse when using CLUSTER as a regressor, than when we did not incporporate it at all.